1 Introduction

1.1 Project Overview

This project builds upon the exploratory and descriptive analysis conducted in Project 1, which examined changes in obesity prevalence and lifestyle behaviors among U.S. adults before and after the onset of COVID-19 using BRFSS data (2018–2023).

Moving beyond trend analysis, Project 2 focuses on statistical modeling and prediction to better understand the factors associated with obesity and BMI in the post-COVID period.

Using nationally representative BRFSS data, we develop predictive and inferential models to assess how demographic, behavioral, and temporal factors relate to obesity outcomes.

In particular, we examine the roles of physical activity, alcohol use, and mental health, while controlling for key sociodemographic covariates.

These models aim to provide rigorous, data-driven insights into obesity risk and its behavioral correlates following the COVID-19 pandemic.

1.2 Data Sources

  • Raw data: stored in Data/Raw/, obtained from the CDC BRFSS Annual Data Portal.
  • Processed data: cleaned and merged dataset saved in Data/Processed/.
  • Final analytical dataset: Data/Processed/brfss_2018_2023.rds (over 100k observations, ~24 variables).

2 Question # 1


Obesity is a major public health concern in the United States and is associated with numerous chronic health conditions. Using Behavioral Risk Factor Surveillance System (BRFSS) data from 2020–2023, this study examines whether demographic and lifestyle characteristics can be used to predict adult obesity status.

Research Question:

Can we predict the likelihood of an adult being obese (BMI ≥ 30) based on demographic and lifestyle factors such as age, sex, race/ethnicity, income category, physical activity, and alcohol consumption?

2.1 Data Preparation & Variable Selection


Body Mass Index (BMI) was used to define obesity status following CDC guidelines. A binary obesity indicator was created where individuals with a BMI of 30 or higher were classified as obese, and those with a BMI below 30 were classified as non-obese. This outcome variable was represented both as a numeric indicator (0/1) for modeling purposes and as a factor variable to support classification-based methods.

brfss = readRDS('../Data/Processed/brfss_2018_2023.rds')

brfss_postcovid <- subset(brfss, interview_year >= 2020)

brfss_postcovid <- brfss_postcovid %>%
  mutate(
    obese_num = ifelse(bmi >= 30, 1, 0),
    obese = factor(obese_num, labels = c("No", "Yes"))
  )


Variables were selected based on prior public health literature and their theoretical relevance to obesity risk. The final analytic dataset includes demographic characteristics, lifestyle behaviors, and temporal indicators that may influence obesity outcomes.

vars <- brfss_postcovid %>%
  dplyr::select(obese, obese_num, bmi, age, sex, race_ethnicity, income_category,
                any_exercise_last_month, binge_drinking,
                max_drinks_30day, interview_year)

glimpse(vars)
## Rows: 87,662
## Columns: 11
## $ obese                   <fct> No, Yes, No, No, No, No, Yes, No, Yes, No, No,…
## $ obese_num               <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0…
## $ bmi                     <dbl> 25.09, 30.41, 22.22, 23.63, 29.21, 25.06, 30.4…
## $ age                     <dbl> 67, 45, 57, 35, 49, 37, 56, 68, 65, 56, 73, 69…
## $ sex                     <fct> Male, Male, Female, Male, Male, Female, Female…
## $ race_ethnicity          <fct> White NH, White NH, Black NH, Black NH, White …
## $ income_category         <fct> 25-35k, <15k, >50k, <15k, >50k, <15k, >50k, 25…
## $ any_exercise_last_month <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes…
## $ binge_drinking          <fct> Yes, Yes, No, Yes, No, Yes, No, Yes, No, Yes, …
## $ max_drinks_30day        <dbl> 5, 5, 1, 3, 5, 9, 6, 4, 5, 2, 6, 4, 4, 5, 6, 3…
## $ interview_year          <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…

2.2 EDA

numeric_vars <- vars %>% 
  dplyr::select(age, bmi, max_drinks_30day, interview_year)

spearman_corr <- cor(numeric_vars, use = "complete.obs", method = "spearman")

corrplot(spearman_corr, method = "color", type = "upper")


The Spearman correlation matrix indicates very weak monotonic relationships among the numerical variables, with the strongest relationships being a weak negative correlation between age and max_drinks_30day (more alcohol consumed in one sitting is associated with younger age) and a weak positive correlation between age and bmi.

densityplot(~ bmi | obese, data = vars)


This density plot shows the distribution of BMI for the non-obese (“No”) and obese (“Yes”) classes. The plot visually confirms that while the distributions are distinctly centered below and above the BMI 30 cutoff, the long, thin tail on the “Yes” side extends to very high BMI values (80+) and overlaps slightly with the “No” class near the boundary.

bwplot(bmi ~ income_category, data = vars)


This boxplot illustrates the distribution of BMI across Income Categories. The chart shows that the median BMI is nearly constant across all five income brackets, but all categories contain extreme outliers, indicating high-BMI individuals exist across the entire socioeconomic spectrum.

xyplot(bmi ~ age, groups = obese, data = vars,
       auto.key = TRUE, type = c("p", "r"))


This scatterplot, showing BMI vs. Age, visually demonstrates the difficulty of the classification task. The two classes (“No” in blue, “Yes” in orange) show significant vertical overlap, especially in middle age, meaning a person’s age provides little clean separation for predicting their obesity status.

lm_temp <- lm(bmi ~ age + sex + race_ethnicity + income_category +
                any_exercise_last_month + binge_drinking +
                max_drinks_30day + interview_year,
              data = vars)

xkabledply(vif(lm_temp))
Table
GVIF Df GVIF^(1/(2*Df))
age 1.0893 1 1.0437
sex 1.1156 1 1.0562
race_ethnicity 1.1014 4 1.0121
income_category 1.1022 4 1.0122
any_exercise_last_month 1.0542 1 1.0267
binge_drinking 1.0169 1 1.0084
max_drinks_30day 1.1726 1 1.0828
interview_year 1.0094 1 1.0047


The key finding is that all GVIF (Generalized VIF) values are very close to 1.0 (with the highest being 1.17 for max_drinks_30day). This confirms that there is no significant multicollinearity among your set of predictors, meaning that each variable contributes unique information to the model and their effects can be reliably separated.

2.3 Modeling Approach

2.3.1 Train–Test Split

set.seed(123)
train_index <- createDataPartition(vars$obese, p = 0.7, list = FALSE)
train <- vars[train_index, ]
test  <- vars[-train_index, ]


The dataset was randomly divided into training (70%) and testing (30%) sets using stratified sampling to preserve the distribution of obesity status. The training set was used for model development, while the testing set was reserved for performance evaluation.

2.3.2 Feature Engineering

X_train <- model.matrix(obese ~ age + sex + race_ethnicity + income_category +
                          any_exercise_last_month + binge_drinking +
                          max_drinks_30day + interview_year,
                        data = train)[, -1]

X_test <- model.matrix(obese ~ age + sex + race_ethnicity + income_category +
                         any_exercise_last_month + binge_drinking +
                         max_drinks_30day + interview_year,
                       data = test)[, -1]


Categorical predictors were converted to numeric format using one-hot encoding to support regression and regularization methods. Feature matrices for the training and testing sets were aligned and cleaned to ensure consistent predictor structures across all models.

2.3.3 Baseline Model: Logistic Regression

glm_baseline <- glm(
  obese ~ age + sex + race_ethnicity + income_category +
    any_exercise_last_month + binge_drinking +
    max_drinks_30day + interview_year,
  data = train,
  family = binomial
)

xkabledply(summary(glm_baseline))
Table
Estimate Std. Error z value Pr(>&#124;z&#124;)
(Intercept) -26.7242 16.1649 -1.6532 0.0983
age 0.0110 0.0006 18.6780 0.0000
sexFemale -0.0511 0.0190 -2.6888 0.0072
race_ethnicityBlack NH 0.5868 0.0385 15.2372 0.0000
race_ethnicityOther NH 0.0337 0.0433 0.7796 0.4356
race_ethnicityMultiracial NH 0.2308 0.0537 4.2962 0.0000
race_ethnicityHispanic 0.2466 0.0294 8.3975 0.0000
income_category15-25k 0.1413 0.0574 2.4607 0.0139
income_category25-35k 0.1724 0.0558 3.0921 0.0020
income_category35-50k 0.2312 0.0535 4.3210 0.0000
income_category>50k 0.2119 0.0486 4.3551 0.0000
any_exercise_last_monthNo 0.5918 0.0246 24.0939 0.0000
binge_drinkingYes -0.0967 0.0185 -5.2353 0.0000
max_drinks_30day 0.0546 0.0035 15.4790 0.0000
interview_year 0.0123 0.0080 1.5357 0.1246


This logistic regression summary shows that nearly all predictors are statistically significant (many at \(p < 0.0001\)), with lack of exercise (coefficient 0.5918), Black NH race (coefficient 0.5868), age (coefficient 0.0110), and max drinks (coefficient 0.0546) having the strongest positive association with the log-odds of being obese. The model suggests that the probability of obesity increases with age, max drinks consumed, being in certain minority groups, and not exercising.

2.3.4 Best Subset Selection

regfit_bmi <- regsubsets(
  bmi ~ age + sex + race_ethnicity + income_category +
    any_exercise_last_month + binge_drinking +
    max_drinks_30day + interview_year,
  data = train,
  nvmax = 10
)

summary(regfit_bmi)
## Subset selection object
## Call: regsubsets.formula(bmi ~ age + sex + race_ethnicity + income_category + 
##     any_exercise_last_month + binge_drinking + max_drinks_30day + 
##     interview_year, data = train, nvmax = 10)
## 14 Variables  (and intercept)
##                              Forced in Forced out
## age                              FALSE      FALSE
## sexFemale                        FALSE      FALSE
## race_ethnicityBlack NH           FALSE      FALSE
## race_ethnicityOther NH           FALSE      FALSE
## race_ethnicityMultiracial NH     FALSE      FALSE
## race_ethnicityHispanic           FALSE      FALSE
## income_category15-25k            FALSE      FALSE
## income_category25-35k            FALSE      FALSE
## income_category35-50k            FALSE      FALSE
## income_category>50k              FALSE      FALSE
## any_exercise_last_monthNo        FALSE      FALSE
## binge_drinkingYes                FALSE      FALSE
## max_drinks_30day                 FALSE      FALSE
## interview_year                   FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           age sexFemale race_ethnicityBlack NH race_ethnicityOther NH
## 1  ( 1 )  " " " "       " "                    " "                   
## 2  ( 1 )  "*" " "       " "                    " "                   
## 3  ( 1 )  "*" " "       " "                    " "                   
## 4  ( 1 )  "*" " "       "*"                    " "                   
## 5  ( 1 )  "*" " "       "*"                    " "                   
## 6  ( 1 )  "*" "*"       "*"                    " "                   
## 7  ( 1 )  "*" "*"       "*"                    " "                   
## 8  ( 1 )  "*" "*"       "*"                    " "                   
## 9  ( 1 )  "*" "*"       "*"                    " "                   
## 10  ( 1 ) "*" "*"       "*"                    " "                   
##           race_ethnicityMultiracial NH race_ethnicityHispanic
## 1  ( 1 )  " "                          " "                   
## 2  ( 1 )  " "                          " "                   
## 3  ( 1 )  " "                          " "                   
## 4  ( 1 )  " "                          " "                   
## 5  ( 1 )  " "                          "*"                   
## 6  ( 1 )  " "                          "*"                   
## 7  ( 1 )  " "                          "*"                   
## 8  ( 1 )  "*"                          "*"                   
## 9  ( 1 )  "*"                          "*"                   
## 10  ( 1 ) "*"                          "*"                   
##           income_category15-25k income_category25-35k income_category35-50k
## 1  ( 1 )  " "                   " "                   " "                  
## 2  ( 1 )  " "                   " "                   " "                  
## 3  ( 1 )  " "                   " "                   " "                  
## 4  ( 1 )  " "                   " "                   " "                  
## 5  ( 1 )  " "                   " "                   " "                  
## 6  ( 1 )  " "                   " "                   " "                  
## 7  ( 1 )  " "                   " "                   " "                  
## 8  ( 1 )  " "                   " "                   " "                  
## 9  ( 1 )  "*"                   " "                   " "                  
## 10  ( 1 ) " "                   " "                   "*"                  
##           income_category>50k any_exercise_last_monthNo binge_drinkingYes
## 1  ( 1 )  " "                 "*"                       " "              
## 2  ( 1 )  " "                 "*"                       " "              
## 3  ( 1 )  " "                 "*"                       " "              
## 4  ( 1 )  " "                 "*"                       " "              
## 5  ( 1 )  " "                 "*"                       " "              
## 6  ( 1 )  " "                 "*"                       " "              
## 7  ( 1 )  " "                 "*"                       "*"              
## 8  ( 1 )  " "                 "*"                       "*"              
## 9  ( 1 )  " "                 "*"                       "*"              
## 10  ( 1 ) "*"                 "*"                       "*"              
##           max_drinks_30day interview_year
## 1  ( 1 )  " "              " "           
## 2  ( 1 )  " "              " "           
## 3  ( 1 )  "*"              " "           
## 4  ( 1 )  "*"              " "           
## 5  ( 1 )  "*"              " "           
## 6  ( 1 )  "*"              " "           
## 7  ( 1 )  "*"              " "           
## 8  ( 1 )  "*"              " "           
## 9  ( 1 )  "*"              " "           
## 10  ( 1 ) "*"              " "


This exhaustive selection process for predicting continuous BMI shows that ‘any_exercise_last_monthNo’ is the single best predictor, followed by age, and then max_drinks_30day and ‘race_ethnicityBlack NH’. The model requires at least 8 to 10 variables to reach the highest Adjusted \(R^2\) (as seen in regsubsets_Q1.png which shows the included variables), confirming that many factors are needed but their combined linear effect on BMI is inherently weak.

2.3.5 LASSO Logistic Regression

X_train_mat <- as.matrix(X_train)
y_train <- ifelse(train$obese == "Yes", 1, 0)

set.seed(123)
cvfit <- cv.glmnet(X_train_mat, y_train, family = "binomial", alpha = 1)

plot(cvfit)


Using cross-validation, LASSO was implemented to find the optimal penalty (\(\lambda\)) that minimizes out-of-sample error, resulting in a slightly higher AUC (0.596) than the standard Logistic Regression (0.595). This process retains the strongest predictors while shrinking the coefficients of less relevant variables toward zero, improving model generalization.

2.3.6 Classification Tree

tree_model <- rpart(
  obese ~ age + sex + race_ethnicity + income_category +
    any_exercise_last_month + binge_drinking +
    max_drinks_30day + interview_year,
  data = train,
  method = "class"
)

pred_tree_prob <- predict(tree_model, newdata = test, type = "prob")[, "Yes"]
pred_tree_class <- factor(ifelse(pred_tree_prob > 0.5, "Yes", "No"), levels = c("No","Yes"))
conf_mat_tree <- confusionMatrix(pred_tree_class, test$obese, positive = "Yes")
print(conf_mat_tree)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  18068  8230
##        Yes     0     0
##                                           
##                Accuracy : 0.687           
##                  95% CI : (0.6814, 0.6927)
##     No Information Rate : 0.687           
##     P-Value [Acc > NIR] : 0.503           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000           
##             Specificity : 1.000           
##          Pos Pred Value :   NaN           
##          Neg Pred Value : 0.687           
##              Prevalence : 0.313           
##          Detection Rate : 0.000           
##    Detection Prevalence : 0.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : Yes             
## 
roc_tree <- roc(response = test$obese, predictor = pred_tree_prob, levels = c("No","Yes"))
print(auc(roc_tree))
## Area under the curve: 0.5


The Classification Tree model developed using rpart is highly interpretable, explicitly showing that the primary split for predicting obesity is based on ‘any_exercise_last_month’, followed by age. The tree achieved the lowest AUC (0.543) of all models, indicating that its simple, sequential decision logic sacrifices predictive accuracy for interpretability.

2.3.7 Random Forest

set.seed(123)
rf_model <- randomForest(
  obese ~ age + sex + race_ethnicity + income_category +
    any_exercise_last_month + binge_drinking +
    max_drinks_30day + interview_year,
  data = train,
  ntree = 500,
  importance = TRUE
)

xkabledply(varImpPlot(rf_model))
Table
MeanDecreaseAccuracy MeanDecreaseGini
age 41.2191 1401.3539
sex 31.8586 146.0663
race_ethnicity 43.6812 413.9502
income_category 16.3846 416.5412
any_exercise_last_month 56.4230 343.5250
binge_drinking 6.9069 143.0563
max_drinks_30day 50.7769 699.7595
interview_year -2.9710 345.2759
pred_rf_prob <- predict(rf_model, newdata = test, type = "prob")[, "Yes"]
pred_rf_class <- predict(rf_model, newdata = test, type = "response")
conf_mat_rf <- confusionMatrix(pred_rf_class, test$obese, positive = "Yes")
print(conf_mat_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  17670  7812
##        Yes   398   418
##                                           
##                Accuracy : 0.6878          
##                  95% CI : (0.6822, 0.6934)
##     No Information Rate : 0.687           
##     P-Value [Acc > NIR] : 0.398           
##                                           
##                   Kappa : 0.0381          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.05079         
##             Specificity : 0.97797         
##          Pos Pred Value : 0.51225         
##          Neg Pred Value : 0.69343         
##              Prevalence : 0.31295         
##          Detection Rate : 0.01589         
##    Detection Prevalence : 0.03103         
##       Balanced Accuracy : 0.51438         
##                                           
##        'Positive' Class : Yes             
## 
roc_rf <- roc(response = test$obese, predictor = pred_rf_prob, levels = c("No","Yes"))
print(auc(roc_rf))
## Area under the curve: 0.5944


This ensemble method generated 500 trees and yielded an AUC of 0.594, which is comparable to the linear models but superior in capturing non-linear effects. The Variable Importance plot confirms that ‘any_exercise_last_month’, ‘max_drinks_30day’, age, and ‘race_ethnicity’ are the most influential factors, showing the highest Mean Decrease Gini.

2.4 Model Evaluation

roc_glm <- roc(test$obese, predict(glm_baseline, test, type = "response"))
roc_tree <- roc(test$obese, predict(tree_model, test, type = "prob")[, "Yes"])
roc_rf   <- roc(test$obese, predict(rf_model, test, type = "prob")[, "Yes"])

auc_table <- tibble(
  Model = c("Logistic Regression", "Classification Tree", "Random Forest"),
  AUC = c(auc(roc_glm), auc(roc_tree), auc(roc_rf))
)

xkabledply(auc_table)
Model: Model ~ AUC
Model AUC
Logistic Regression 0.5953
Classification Tree 0.5000
Random Forest 0.5944


This comparison demonstrates that tree-based ensemble methods, like Random Forest, and Logistic Regression achieved nearly identical predictive performance, hovering just under an AUC of 0.60. The Classification Tree, due to its simplicity, performed the worst with an AUC of 0.5000.

2.4.1 Critical Limitations

Weak Predictive Power: The model’s low AUC (\(\approx 0.60\)) and very low McFadden pseudo-\(R^2\) (\(\approx 0.02\)) confirm that the tested variables have weak explanatory power over obesity status.

Imbalance/Bias: Models show extremely high Specificity (\(\approx 0.98\)) but fail to detect the obese class due to critically low Sensitivity (\(\approx 0.03-0.05\)), indicating severe class imbalance and a conservative prediction bias toward the majority “No” class.

Insufficient Features: The low predictive success suggests crucial drivers of obesity (e.g., genetics, detailed diet) are not captured by the current BRFSS demographic and lifestyle features.

2.4.2 Recommendations for Future Work

Address Imbalance: Implement undersampling to balance the “No” and “Yes” classes to improve Sensitivity and reduce prediction bias.

Feature Augmentation: Introduce better features (e.g., specific diet indicators or local environmental data) to increase the model’s overall signal.

Non-Linear Models: Test an SVM with a non-linear kernel to determine if a complex, curved boundary can better separate the overlapping data points.

3 Question # 2


Physical activity is widely recognized as a key behavioral factor linked to weight outcomes and chronic disease risk. Using BRFSS data from 2018–2023, this question evaluates how strongly physical activity is associated with BMI and whether the relationship remains after accounting for demographic and health covariates.

Research Question:

To what extent can physical activity level predict an individual’s BMI, while controlling for demographic covariates and survey year?

3.1 Data Preparation & Variable Selection


We first restricted the dataset to the variables needed for BMI, physical activity, demographics, health status indicators, and survey year. Variables were recoded into appropriate factor formats and ordered where meaningful (e.g., interview year, BMI category, age group, education). We also created several binary indicators to support modeling: obesity status (obese), exercise (exercise), diabetes (diabetes), heart attack history (heart_attack), coronary heart disease history (coronary), stroke history (stroke), and self-reported health (self_health). These transformations ensure consistent interpretation across regression and classification analyses.

library(dplyr)
library(survey)
library(ggplot2)
library(tidyr)
library(ezids)
library(car)
library(pROC)
library(pscl)
library(corrplot)

brfss = readRDS("../Data/Processed/brfss_2018_2023.rds")

brfss_1 = brfss[,c(1:8,12:16,19:22,24)]

str(brfss_1)
## tibble [134,283 × 18] (S3: tbl_df/tbl/data.frame)
##  $ weight_final           : num [1:134283] 4060 1048 518 674 932 ...
##  $ strata                 : num [1:134283] 11012 11021 11031 11042 11041 ...
##  $ psu                    : num [1:134283] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ interview_year         : num [1:134283] 2018 2018 2018 2018 2018 ...
##  $ bmi                    : num [1:134283] 30.6 34.3 31.7 32 23.7 ...
##  $ bmi_category           : Factor w/ 4 levels "Underweight",..: 4 4 4 4 2 2 3 3 2 2 ...
##  $ overweight_or_obese    : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 2 2 1 1 ...
##  $ any_exercise_last_month: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 2 2 2 1 ...
##  $ sex                    : Factor w/ 2 levels "Male","Female": 1 2 1 2 1 2 1 2 2 1 ...
##  $ age                    : num [1:134283] 49 33 45 57 46 74 48 58 79 41 ...
##  $ age_group              : Factor w/ 13 levels "18-24","25-29",..: 6 3 6 8 6 11 6 8 12 5 ...
##  $ race_ethnicity         : Factor w/ 5 levels "White NH","Black NH",..: 5 1 1 1 1 1 1 1 1 1 ...
##  $ education_level        : Factor w/ 6 levels "No school/K",..: 4 6 6 5 6 6 5 3 5 5 ...
##  $ diabetes_status        : Factor w/ 4 levels "Yes","Yes, only pregnancy",..: 4 3 3 1 3 3 3 3 3 3 ...
##  $ heart_attack_history   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ coronary_hd_history    : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ stroke_history         : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ self_reported_health   : Factor w/ 2 levels "Good or Better",..: 1 1 1 2 1 1 1 2 1 1 ...
brfss_1$psu = as.factor(brfss_1$psu)
brfss_1$psu <- factor(brfss_1$psu)

brfss_1$interview_year = as.factor(brfss_1$interview_year)
brfss_1$interview_year <- factor(brfss_1$interview_year, order=T,
levels = c("2018","2019","2020","2021","2022","2023"))

brfss_1$bmi_category = as.factor(brfss_1$bmi_category)
brfss_1$bmi_category <- factor(brfss_1$bmi_category, order=T,
levels = c("Underweight","Normal weight","Overweight","Obese"))

brfss_1$overweight_or_obese = as.factor(brfss_1$overweight_or_obese)
brfss_1$overweight_or_obese <- factor(brfss_1$overweight_or_obese, order=T,
levels = c("Yes","No"))

brfss_1$any_exercise_last_month = as.factor(brfss_1$any_exercise_last_month)
brfss_1$any_exercise_last_month <- factor(brfss_1$any_exercise_last_month, order=T,
levels = c("No","Yes"))

brfss_1$sex = as.factor(brfss_1$sex)
brfss_1$sex <- factor(brfss_1$sex, order=F)

brfss_1$age_group = as.factor(brfss_1$age_group)
brfss_1$age_group <- factor(brfss_1$age_group, order=T,
levels = c("18-24","25-29","30-34","35-39","40-44","45-49",
"50-54","55-59","60-64","65-69","70-74","75-79","80 or older"))

brfss_1$race_ethnicity = as.factor(brfss_1$race_ethnicity)
brfss_1$race_ethnicity <- factor(brfss_1$race_ethnicity, order=F)

brfss_1$education_level = as.factor(brfss_1$education_level)
brfss_1$education_level <- factor(brfss_1$education_level, order=T,
levels = c("No school/K","Grades 1-8","Grades 9-11",
"Grade 12/GED","College 1-3 years","College 4+ years"))

brfss_1$diabetes_status = as.factor(brfss_1$diabetes_status)
brfss_1$diabetes_status <- factor(brfss_1$diabetes_status, order=T,
levels = c("No","Yes","Pre-diabetes","Yes, only pregnancy"))

brfss_1$coronary_hd_history = as.factor(brfss_1$coronary_hd_history)
brfss_1$coronary_hd_history <- factor(brfss_1$coronary_hd_history, order=T,
levels = c("No","Yes"))

brfss_1$stroke_history = as.factor(brfss_1$stroke_history)
brfss_1$stroke_history <- factor(brfss_1$stroke_history, order=T,
levels = c("No","Yes"))

brfss_1$self_reported_health = as.factor(brfss_1$self_reported_health)
brfss_1$self_reported_health <- factor(brfss_1$self_reported_health, order=T,
levels = c("Good or Better","Fair or Poor"))

brfss_1 <- brfss_1 %>%
mutate(obese = ifelse(bmi >= 30, 1, 0),
exercise = ifelse(any_exercise_last_month == "No", 0, 1),
diabetes = ifelse(diabetes_status == "No", 0, 1),
heart_attack = ifelse(heart_attack_history == "No", 0, 1),
coronary = ifelse(coronary_hd_history == "No", 0, 1),
stroke = ifelse(stroke_history == "No", 0, 1),
self_health = ifelse(self_reported_health == "Fair or Poor", 0, 1))

str(brfss_1)
## tibble [134,283 × 25] (S3: tbl_df/tbl/data.frame)
##  $ weight_final           : num [1:134283] 4060 1048 518 674 932 ...
##  $ strata                 : num [1:134283] 11012 11021 11031 11042 11041 ...
##  $ psu                    : Factor w/ 67055 levels "2018000001","2018000002",..: 74 161 316 386 395 645 675 712 718 870 ...
##  $ interview_year         : Ord.factor w/ 6 levels "2018"<"2019"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bmi                    : num [1:134283] 30.6 34.3 31.7 32 23.7 ...
##  $ bmi_category           : Ord.factor w/ 4 levels "Underweight"<..: 4 4 4 4 2 2 3 3 2 2 ...
##  $ overweight_or_obese    : Ord.factor w/ 2 levels "Yes"<"No": 1 1 1 1 2 2 1 1 2 2 ...
##  $ any_exercise_last_month: Ord.factor w/ 2 levels "No"<"Yes": 2 1 2 1 2 2 1 1 1 2 ...
##  $ sex                    : Factor w/ 2 levels "Male","Female": 1 2 1 2 1 2 1 2 2 1 ...
##  $ age                    : num [1:134283] 49 33 45 57 46 74 48 58 79 41 ...
##  $ age_group              : Ord.factor w/ 13 levels "18-24"<"25-29"<..: 6 3 6 8 6 11 6 8 12 5 ...
##  $ race_ethnicity         : Factor w/ 5 levels "White NH","Black NH",..: 5 1 1 1 1 1 1 1 1 1 ...
##  $ education_level        : Ord.factor w/ 6 levels "No school/K"<..: 4 6 6 5 6 6 5 3 5 5 ...
##  $ diabetes_status        : Ord.factor w/ 4 levels "No"<"Yes"<"Pre-diabetes"<..: 3 1 1 2 1 1 1 1 1 1 ...
##  $ heart_attack_history   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ coronary_hd_history    : Ord.factor w/ 2 levels "No"<"Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stroke_history         : Ord.factor w/ 2 levels "No"<"Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ self_reported_health   : Ord.factor w/ 2 levels "Good or Better"<..: 1 1 1 2 1 1 1 2 1 1 ...
##  $ obese                  : num [1:134283] 1 1 1 1 0 0 0 0 0 0 ...
##  $ exercise               : num [1:134283] 1 0 1 0 1 1 0 0 0 1 ...
##  $ diabetes               : num [1:134283] 1 0 0 1 0 0 0 0 0 0 ...
##  $ heart_attack           : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
##  $ coronary               : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stroke                 : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
##  $ self_health            : num [1:134283] 1 1 1 0 1 1 1 0 1 1 ...


3.2 EDA


To assess relationships among numeric and binary predictors, we produced a Spearman correlation matrix. Spearman correlation is appropriate here because several variables are binary indicators and many relationships may be monotonic but not strictly linear. The correlation heatmap provides a quick check for potentially redundant predictors and highlights which health indicators tend to move together.

brfss_nums = brfss_1[,c(5,10,19:25)]
str(brfss_nums)
## tibble [134,283 × 9] (S3: tbl_df/tbl/data.frame)
##  $ bmi         : num [1:134283] 30.6 34.3 31.7 32 23.7 ...
##  $ age         : num [1:134283] 49 33 45 57 46 74 48 58 79 41 ...
##  $ obese       : num [1:134283] 1 1 1 1 0 0 0 0 0 0 ...
##  $ exercise    : num [1:134283] 1 0 1 0 1 1 0 0 0 1 ...
##  $ diabetes    : num [1:134283] 1 0 0 1 0 0 0 0 0 0 ...
##  $ heart_attack: num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
##  $ coronary    : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stroke      : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
##  $ self_health : num [1:134283] 1 1 1 0 1 1 1 0 1 1 ...
spearman_corr <- cor(brfss_nums, use = "complete.obs", method = "spearman")
print(spearman_corr)
##                      bmi         age       obese    exercise    diabetes
## bmi           1.00000000  0.14542682  0.79974685 -0.10712513  0.16409697
## age           0.14542682  1.00000000  0.07732375 -0.10636663  0.18201116
## obese         0.79974685  0.07732375  1.00000000 -0.10899996  0.14803636
## exercise     -0.10712513 -0.10636663 -0.10899996  1.00000000 -0.09700948
## diabetes      0.16409697  0.18201116  0.14803636 -0.09700948  1.00000000
## heart_attack  0.04499111  0.14826329  0.03779305 -0.05594274  0.11108864
## coronary      0.05060258  0.16309989  0.04128937 -0.04817855  0.10572123
## stroke        0.02079017  0.10939054  0.01845566 -0.05279714  0.07140363
## self_health  -0.11682448 -0.08864314 -0.12276598  0.17764596 -0.18239053
##              heart_attack    coronary      stroke self_health
## bmi            0.04499111  0.05060258  0.02079017 -0.11682448
## age            0.14826329  0.16309989  0.10939054 -0.08864314
## obese          0.03779305  0.04128937  0.01845566 -0.12276598
## exercise      -0.05594274 -0.04817855 -0.05279714  0.17764596
## diabetes       0.11108864  0.10572123  0.07140363 -0.18239053
## heart_attack   1.00000000  0.43611656  0.15926854 -0.13718744
## coronary       0.43611656  1.00000000  0.11924987 -0.13466444
## stroke         0.15926854  0.11924987  1.00000000 -0.12064862
## self_health   -0.13718744 -0.13466444 -0.12064862  1.00000000
corrplot(spearman_corr, method = "color", type = "upper", tl.cex = 0.8)


3.3 Modeling Approach


We first modeled BMI as a continuous outcome to estimate how average BMI differs between people who were physically active vs not active. This establishes the baseline direction and magnitude of association between exercise and BMI.

fit1 <- lm(bmi ~ exercise, data = brfss_1)
summary(fit1)
## 
## Call:
## lm(formula = bmi ~ exercise, data = brfss_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.985  -4.088  -0.815   2.942  62.452 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.86472    0.04135  722.20   <2e-16 ***
## exercise    -1.99688    0.04484  -44.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.859 on 134281 degrees of freedom
## Multiple R-squared:  0.01455,    Adjusted R-squared:  0.01455 
## F-statistic:  1983 on 1 and 134281 DF,  p-value: < 2.2e-16
xkabledply(fit1, title = paste("Model (factor):", format(formula(fit1)) ) )
Model (factor): bmi ~ exercise
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 29.8647 0.0414 722.1979 0
exercise -1.9969 0.0448 -44.5342 0
xkablevif(fit1)
VIFs of Model: bmi ~ exercise
exercise
1
plot(fit1)


This model estimates the mean difference in BMI between the exercise groups. A statistically significant exercise coefficient would suggest that physically active adults have a different average BMI than inactive adults, but this model does not yet adjust for age or health status.

3.4 Linear Regression with Covariates


To reduce confounding, we expanded the model to include age and self-reported health. This allows us to test whether the association between exercise and BMI persists after accounting for important demographic and health differences.

fit2 <- lm(bmi ~ exercise + age + self_health, data = brfss_1)
summary(fit2)
## 
## Call:
## lm(formula = bmi ~ exercise + age + self_health, data = brfss_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.125  -3.903  -0.847   2.895  62.836 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.43237    0.07887  385.87   <2e-16 ***
## exercise    -1.51155    0.04528  -33.38   <2e-16 ***
## age          0.02743    0.00102   26.89   <2e-16 ***
## self_health -2.42398    0.05475  -44.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.797 on 134279 degrees of freedom
## Multiple R-squared:  0.03521,    Adjusted R-squared:  0.03519 
## F-statistic:  1633 on 3 and 134279 DF,  p-value: < 2.2e-16
xkabledply(fit2, title = paste("Model (num):", format(formula(fit2)) ) )
Model (num): bmi ~ exercise + age + self_health
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 30.4324 0.0789 385.8721 0
exercise -1.5115 0.0453 -33.3803 0
age 0.0274 0.0010 26.8850 0
self_health -2.4240 0.0547 -44.2774 0
xkablevif(fit2)
VIFs of Model: bmi ~ exercise + age + self_health
age exercise self_health
1.017 1.042 1.038
plot(fit2)


Including covariates helps interpret the exercise coefficient as the estimated association with BMI while holding age and self-reported health constant.

3.5 Expanded Linear Model


We further included additional health indicators such as diabetes and stroke history. This model assesses whether exercise remains associated with BMI even after controlling for comorbidity patterns that are strongly linked with weight status.

fit3 <- lm(bmi ~ exercise + age + self_reported_health + diabetes + stroke, data = brfss_1)
summary(fit3)
## 
## Call:
## lm(formula = bmi ~ exercise + age + self_reported_health + diabetes + 
##     stroke, data = brfss_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.379  -3.861  -0.817   2.874  62.945 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            29.08406    0.06604 440.410  < 2e-16 ***
## exercise               -1.39941    0.04493 -31.144  < 2e-16 ***
## age                     0.01936    0.00103  18.804  < 2e-16 ***
## self_reported_health.L  1.42341    0.03904  36.462  < 2e-16 ***
## diabetes                2.99388    0.05968  50.163  < 2e-16 ***
## stroke                 -0.57539    0.12720  -4.523 6.09e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.743 on 134277 degrees of freedom
## Multiple R-squared:  0.05301,    Adjusted R-squared:  0.05298 
## F-statistic:  1503 on 5 and 134277 DF,  p-value: < 2.2e-16
xkablevif(fit3)
VIFs of Model: bmi ~ exercise + age + self_reported_health + diabetes + stroke
age diabetes exercise self_reported_health.L stroke
1.056 1.07 1.045 1.076 1.028
plot(fit3)

## Interaction Models
Because the relationship between exercise and BMI may differ across health status groups, we tested interaction terms. Interaction models evaluate whether the effect of one predictor changes depending on another predictor (e.g., exercise effect differing across self-reported health categories).

mixfit1 = lm(bmi ~ exercise + age + exercise:self_health, data = brfss_1)
summary(mixfit1)
## 
## Call:
## lm(formula = bmi ~ exercise + age + exercise:self_health, data = brfss_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.888  -3.911  -0.853   2.893  62.873 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          28.496980   0.063932  445.74   <2e-16 ***
## exercise              0.755831   0.075228   10.05   <2e-16 ***
## age                   0.028389   0.001019   27.85   <2e-16 ***
## exercise:self_health -2.827551   0.065666  -43.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.799 on 134279 degrees of freedom
## Multiple R-squared:  0.03445,    Adjusted R-squared:  0.03443 
## F-statistic:  1597 on 3 and 134279 DF,  p-value: < 2.2e-16
mixfit2 <- lm(bmi ~ exercise + age + self_reported_health + diabetes * stroke, data = brfss_1)
summary(mixfit2)
## 
## Call:
## lm(formula = bmi ~ exercise + age + self_reported_health + diabetes * 
##     stroke, data = brfss_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.413  -3.862  -0.815   2.873  62.947 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            29.08356    0.06604  440.42  < 2e-16 ***
## exercise               -1.39861    0.04493  -31.13  < 2e-16 ***
## age                     0.01929    0.00103   18.73  < 2e-16 ***
## self_reported_health.L  1.42325    0.03904   36.46  < 2e-16 ***
## diabetes                3.03401    0.06085   49.86  < 2e-16 ***
## stroke                 -0.34379    0.14448   -2.38 0.017335 *  
## diabetes:stroke        -1.01526    0.30038   -3.38 0.000725 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.743 on 134276 degrees of freedom
## Multiple R-squared:  0.0531, Adjusted R-squared:  0.05305 
## F-statistic:  1255 on 6 and 134276 DF,  p-value: < 2.2e-16


A significant interaction term would indicate that the association between predictors is not constant across groups (e.g., exercise relates differently to BMI for people with poor vs good health).

3.6 Association Tests Using Categories

3.7 Chi-Squared Test: Physical Activity vs BMI Category


To evaluate categorical relationships, we created a contingency table between physical activity and BMI category and performed a chi-squared test. This tests whether BMI category distributions differ between physically active and inactive groups.

bmi_activitytable = xtabs(~ any_exercise_last_month + bmi_category, data = brfss_1)
bmi_activitytable
##                        bmi_category
## any_exercise_last_month Underweight Normal weight Overweight Obese
##                     No          307          4571       6599  8595
##                     Yes        1254         35789      44383 32785
chisqactivity = chisq.test(bmi_activitytable)
chisqactivity
## 
##  Pearson's Chi-squared test
## 
## data:  bmi_activitytable
## X-squared = 1708.6, df = 3, p-value < 2.2e-16


A small p-value supports the conclusion that BMI category and physical activity are not independent, meaning exercise status is associated with BMI classification categories.

3.8 Chi-Squared Test: Self-Reported Health vs BMI Category


We repeated the chi-squared procedure for self-reported health and BMI category to examine whether perceived health is related to BMI classification.

bmi_selfhealthtable = xtabs(~ self_reported_health + bmi_category, data = brfss_1)
bmi_selfhealthtable
##                     bmi_category
## self_reported_health Underweight Normal weight Overweight Obese
##       Good or Better        1314         37663      47259 35171
##       Fair or Poor           247          2697       3723  6209
chisqselfhealth = chisq.test(bmi_selfhealthtable)
chisqselfhealth
## 
##  Pearson's Chi-squared test
## 
## data:  bmi_selfhealthtable
## X-squared = 2170.8, df = 3, p-value < 2.2e-16

3.9 Chi-Squared Test: Self-Reported Health vs Overweight/Obese Indicator


We also tested self-reported health against the overweight/obese indicator to examine whether perceived health differs across weight status.

bmi_selfhealthtable2 = xtabs(~ self_reported_health + overweight_or_obese, data = brfss_1)
bmi_selfhealthtable2
##                     overweight_or_obese
## self_reported_health   Yes    No
##       Good or Better 82430 38977
##       Fair or Poor    9932  2944
chisqselfhealth2 = chisq.test(bmi_selfhealthtable2)
chisqselfhealth2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bmi_selfhealthtable2
## X-squared = 462.46, df = 1, p-value < 2.2e-16

3.10 Difference in Mean BMI by Physical Activity (Two-Sample t-Test)


To directly compare mean BMI across physical activity groups, we conducted a two-sample t-test. This tests whether average BMI differs between adults who reported exercising in the last month versus those who did not.

t.test(brfss_1$bmi ~ brfss_1$any_exercise_last_month)
## 
##  Welch Two Sample t-test
## 
## data:  brfss_1$bmi by brfss_1$any_exercise_last_month
## t = 38.255, df = 24830, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  1.894568 2.099195
## sample estimates:
##  mean in group No mean in group Yes 
##          29.86472          27.86784
mean_bmi = aggregate(bmi ~ any_exercise_last_month, data = brfss_1, FUN = mean)
mean_bmi
##   any_exercise_last_month      bmi
## 1                      No 29.86472
## 2                     Yes 27.86784


A significant t-test indicates a meaningful difference in average BMI between exercise groups, and the group means provide the direction of that difference.

3.11 Logistic Regression & ROC/AUC (Classification View)


In addition to modeling BMI directly, we used logistic regression to examine predictive ability in a classification setting. We first modeled physical activity status from BMI (and additional covariates), and evaluated performance using ROC curves and AUC values.

activityLogit <- glm(any_exercise_last_month ~ bmi, data = brfss_1, family = "binomial")
summary(activityLogit)
## 
## Call:
## glm(formula = any_exercise_last_month ~ bmi, family = "binomial", 
##     data = brfss_1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.219666   0.035791   89.96   <2e-16 ***
## bmi         -0.051425   0.001188  -43.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113281  on 134282  degrees of freedom
## Residual deviance: 111476  on 134281  degrees of freedom
## AIC: 111480
## 
## Number of Fisher Scoring iterations: 4
prob = predict(activityLogit, type = "response")
brfss_1$prob = prob
h <- roc(any_exercise_last_month ~ prob, data = brfss_1)
auc(h)
## Area under the curve: 0.5867
plot(h)

activityLogit2 <- glm(any_exercise_last_month ~ bmi + age + self_reported_health, data = brfss_1, family = "binomial")
summary(activityLogit2)
## 
## Call:
## glm(formula = any_exercise_last_month ~ bmi + age + self_reported_health, 
##     family = "binomial", data = brfss_1)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             3.2591001  0.0454041   71.78   <2e-16 ***
## bmi                    -0.0405948  0.0012316  -32.96   <2e-16 ***
## age                    -0.0163612  0.0005011  -32.65   <2e-16 ***
## self_reported_health.L -0.7728496  0.0148626  -52.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113281  on 134282  degrees of freedom
## Residual deviance: 107655  on 134279  degrees of freedom
## AIC: 107663
## 
## Number of Fisher Scoring iterations: 4
prob2 = predict(activityLogit2, type = "response")
brfss_1$prob2 = prob2
i <- roc(any_exercise_last_month ~ prob2, data = brfss_1)
auc(i)
## Area under the curve: 0.6505
plot(i)


ROC/AUC provides a threshold-independent measure of discrimination. AUC values closer to 1 indicate better separation, while values near 0.5 indicate weak discrimination.

3.12 Predicting Obesity (BMI ≥ 30) Using Exercise and Health Covariates


To directly connect physical activity to obesity status, we fit logistic regression models predicting obesity from exercise and covariates. We evaluated performance using ROC/AUC and also computed McFadden’s pseudo-R² as a measure of explanatory strength.

activityLogit3 <- glm(obese ~ exercise + age + self_health, data = brfss_1, family = "binomial")
summary(activityLogit3)
## 
## Call:
## glm(formula = obese ~ exercise + age + self_health, family = "binomial", 
##     data = brfss_1)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.0802824  0.0285175  -2.815  0.00487 ** 
## exercise    -0.4907626  0.0161551 -30.378  < 2e-16 ***
## age          0.0068095  0.0003838  17.742  < 2e-16 ***
## self_health -0.6946082  0.0192019 -36.174  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165871  on 134282  degrees of freedom
## Residual deviance: 162650  on 134279  degrees of freedom
## AIC: 162658
## 
## Number of Fisher Scoring iterations: 4
prob3 = predict(activityLogit3, type = "response")
brfss_1$prob3 = prob3
j <- roc(obese ~ prob3, data = brfss_1)
auc(j)
## Area under the curve: 0.5963
plot(j)

activityLogit3pr2 = pR2(activityLogit3)
## fitting null model for pseudo-r2
activityLogit3pr2
##           llh       llhNull            G2      McFadden          r2ML 
## -8.132504e+04 -8.293541e+04  3.220745e+03  1.941719e-02  2.369941e-02 
##          r2CU 
##  3.341554e-02


The sign and significance of the exercise coefficient indicate whether exercise is associated with lower odds of obesity after adjusting for age and self-reported health. AUC summarizes predictive discrimination, while McFadden’s pseudo-R² provides a model-fit comparison measure.

3.13 Best Subset Selection (bestglm)


To explore whether a smaller set of predictors performs comparably, we applied exhaustive best subset selection using AIC to identify a parsimonious logistic model.

library(bestglm)

# -------------------------------
# Model 1
# -------------------------------
brfss_3 <- brfss_1[, c(10, 19:25)]

# Convert all columns to numeric (required for bestglm)
brfss_3 <- data.frame(lapply(brfss_3, function(x) as.numeric(unlist(x))))

# Force binary response 0/1
brfss_3[, 1] <- ifelse(brfss_3[, 1] == 1, 1, 0)

# Remove missing values
brfss_3 <- na.omit(brfss_3)

res.bestglm1 <- bestglm(
  Xy = brfss_3,
  family = binomial,
  IC = "AIC",
  method = "exhaustive"
)

summary(res.bestglm1)
## Fitting algorithm:  AIC-glm
## Best Model:
##                df deviance
## Null Model 134276 76041.30
## Full Model 134282 84853.63
## 
##  likelihood-ratio test - GLM
## 
## data:  H0: Null Model vs. H1: Best Fit AIC-glm
## X = 8812.3, df = 6, p-value < 2.2e-16
res.bestglm1$BestModels
##     age obese exercise diabetes heart_attack coronary stroke Criterion
## 1 FALSE  TRUE     TRUE     TRUE         TRUE     TRUE   TRUE  76053.30
## 2  TRUE  TRUE     TRUE     TRUE         TRUE     TRUE   TRUE  76055.30
## 3 FALSE  TRUE     TRUE     TRUE        FALSE     TRUE   TRUE  76306.89
## 4  TRUE  TRUE     TRUE     TRUE        FALSE     TRUE   TRUE  76308.89
## 5 FALSE  TRUE     TRUE     TRUE         TRUE    FALSE   TRUE  76367.12
# -------------------------------
# Model 2
# -------------------------------
brfss_2 <- brfss_1[, c(10, 5, 19:25)]

# Convert all columns to numeric
brfss_2 <- data.frame(lapply(brfss_2, function(x) as.numeric(unlist(x))))

# Force binary response 0/1
brfss_2[, 1] <- ifelse(brfss_2[, 1] == 1, 1, 0)

# Remove missing values
brfss_2 <- na.omit(brfss_2)

res.bestglm <- bestglm(
  Xy = brfss_2,
  family = binomial,
  IC = "AIC",
  method = "exhaustive"
)

summary(res.bestglm)
## Fitting algorithm:  AIC-glm
## Best Model:
##                df deviance
## Null Model 134275 75734.53
## Full Model 134282 84853.63
## 
##  likelihood-ratio test - GLM
## 
## data:  H0: Null Model vs. H1: Best Fit AIC-glm
## X = 9119.1, df = 7, p-value < 2.2e-16
res.bestglm$BestModels
##     age  bmi obese exercise diabetes heart_attack coronary stroke Criterion
## 1 FALSE TRUE  TRUE     TRUE     TRUE         TRUE     TRUE   TRUE  75748.53
## 2  TRUE TRUE  TRUE     TRUE     TRUE         TRUE     TRUE   TRUE  75750.53
## 3 FALSE TRUE FALSE     TRUE     TRUE         TRUE     TRUE   TRUE  75783.88
## 4  TRUE TRUE FALSE     TRUE     TRUE         TRUE     TRUE   TRUE  75785.88
## 5 FALSE TRUE  TRUE     TRUE     TRUE        FALSE     TRUE   TRUE  76004.66

3.14 Expanded Obesity Model + Confusion Matrix


We also fit expanded logistic models including additional comorbidity indicators (diabetes, stroke, coronary disease, heart attack history). Model evaluation included ROC/AUC and confusion matrices to summarize classification behavior at a default threshold (0.5).

activityLogit4 <- glm(obese ~ exercise + age + self_health + diabetes + stroke + heart_attack + coronary,
data = brfss_3, family = "binomial")
summary(activityLogit4)
## 
## Call:
## glm(formula = obese ~ exercise + age + self_health + diabetes + 
##     stroke + heart_attack + coronary, family = "binomial", data = brfss_3)
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.02548    0.02206   1.155   0.2481    
## exercise     -0.47770    0.01627 -29.358  < 2e-16 ***
## age                NA         NA      NA       NA    
## self_health  -0.57976    0.01990 -29.128  < 2e-16 ***
## diabetes      0.90007    0.02102  42.826  < 2e-16 ***
## stroke       -0.11945    0.04814  -2.481   0.0131 *  
## heart_attack  0.05146    0.04348   1.183   0.2366    
## coronary      0.16596    0.04257   3.899 9.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165871  on 134282  degrees of freedom
## Residual deviance: 161075  on 134276  degrees of freedom
## AIC: 161089
## 
## Number of Fisher Scoring iterations: 4
prob4 = predict(activityLogit4, type = "response")
brfss_3$prob4 = prob4
k <- roc(obese ~ prob4, data = brfss_3)
auc(k)
## Area under the curve: 0.5925
plot(k)

activityLogit4pr2 = pR2(activityLogit4)
## fitting null model for pseudo-r2
activityLogit4pr2
##           llh       llhNull            G2      McFadden          r2ML 
## -8.053732e+04 -8.293541e+04  4.796181e+03  2.891516e-02  3.508664e-02 
##          r2CU 
##  4.947124e-02
activityLogit5 <- glm(obese ~ exercise + age + self_health + diabetes + stroke + coronary,
data = brfss_3, family = "binomial")
summary(activityLogit5)
## 
## Call:
## glm(formula = obese ~ exercise + age + self_health + diabetes + 
##     stroke + coronary, family = "binomial", data = brfss_3)
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.02759    0.02199   1.255   0.2095    
## exercise    -0.47803    0.01627 -29.383  < 2e-16 ***
## age               NA         NA      NA       NA    
## self_health -0.58125    0.01986 -29.262  < 2e-16 ***
## diabetes     0.90141    0.02099  42.951  < 2e-16 ***
## stroke      -0.11304    0.04782  -2.364   0.0181 *  
## coronary     0.18701    0.03865   4.839 1.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165871  on 134282  degrees of freedom
## Residual deviance: 161076  on 134277  degrees of freedom
## AIC: 161088
## 
## Number of Fisher Scoring iterations: 4
prob5 = predict(activityLogit5, type = "response")
brfss_3$prob5 = prob5
k <- roc(obese ~ prob5, data = brfss_3)
auc(k)
## Area under the curve: 0.5922
plot(k)

activityLogit5pr2 = pR2(activityLogit5)
## fitting null model for pseudo-r2
activityLogit5pr2
##           llh       llhNull            G2      McFadden          r2ML 
## -8.053802e+04 -8.293541e+04  4.794784e+03  2.890674e-02  3.507660e-02 
##          r2CU 
##  4.945709e-02
library(ModelMetrics)
xkabledply(confusionMatrix(actual = activityLogit5$y, predicted = activityLogit5$fitted.values),
title = "Confusion matrix from Logit Model")
Confusion matrix from Logit Model
1 2
89332 36988
3571 4392


The confusion matrix complements AUC by showing how predictions behave under a specific threshold. Together with pseudo-R², these outputs summarize both discrimination and overall explanatory strength.

3.14.1 Critical Limitations

BMI is self-reported in BRFSS, which can introduce measurement error and bias.

Physical activity is measured as a binary “any exercise last month” indicator, which does not capture intensity, frequency, or duration, limiting predictive power.

Models may be affected by omitted variables (diet quality, environment, access to recreation, and socioeconomic detail), and associations should not be interpreted as causal.

3.14.2 Recommendations for Future Work

EReplace binary physical activity with richer measures (minutes per week or activity intensity) if available to better capture behavioral variation.

Include broader socioeconomic and behavioral covariates (income, race/ethnicity, alcohol use, smoking) to reduce confounding.

Test survey-weighted models (BRFSS sampling weights) for more accurate population-level inference.

4 Question # 3


The COVID-19 pandemic introduced major disruptions to daily life, healthcare access, and emotional wellbeing. Because obesity is often associated with increased depression risk, we examine whether the obesity–depression relationship changed after the onset of COVID-19 using BRFSS 2018–2023.

Research Question:

Did the relationship between obesity and depression change after the onset of COVID-19, controlling for key demographics (age group and sex)?

4.1 Data Preparation & Variable Selection


We created binary indicators for the primary variables. Obesity status was recoded from overweight_or_obese into a numeric indicator where 0 = non-obese and 1 = obese. Depression history was recoded into a numeric indicator where 0 = no and 1 = yes using depression_history. To capture pandemic timing, we created a COVID period indicator where pre-COVID = 0 (interview_year ≤ 2019) and post-COVID = 1 (interview_year ≥ 2020).

The final modeling dataset included depress, obese, covid_period, age_group, and sex, and missing values were removed prior to modeling.

library(tidyverse)

brfss <- readRDS("../Data/Processed/brfss_2018_2023.rds")

brfss$overweight_or_obese <- as.character(brfss$overweight_or_obese)

brfss <- brfss %>%
  mutate(
    obese = if_else(overweight_or_obese == "Yes", 1L, 0L),
    covid_period = if_else(interview_year <= 2019, 0L, 1L),
    depress = if_else(depression_history == "Yes", 1L, 0L)
  )

model_data <- brfss %>%
  dplyr::select(depress, obese, covid_period, age_group, sex) %>%
  tidyr::drop_na()

4.2 Modeling Approach


To test whether COVID changed the obesity–depression relationship, we fit a logistic regression model with an interaction term between obesity and COVID period:

depress∼obese∗covid_period+age_group+sex

The interaction term obese:covid_period evaluates whether the effect of obesity on depression differs between pre-COVID and post-COVID.

model_int_fixed <- glm(
depress ~ obese * covid_period + age_group + sex,
data   = model_data,
family = binomial()
)

summary(model_int_fixed)
## 
## Call:
## glm(formula = depress ~ obese * covid_period + age_group + sex, 
##     family = binomial(), data = model_data)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.74450    0.02846 -61.289  < 2e-16 ***
## obese                 0.26203    0.02614  10.023  < 2e-16 ***
## covid_period          0.18991    0.02613   7.268 3.64e-13 ***
## age_group25-29       -0.16236    0.02757  -5.888 3.91e-09 ***
## age_group30-34       -0.20490    0.02793  -7.335 2.22e-13 ***
## age_group35-39       -0.33792    0.02877 -11.747  < 2e-16 ***
## age_group40-44       -0.46151    0.02994 -15.415  < 2e-16 ***
## age_group45-49       -0.47088    0.03091 -15.233  < 2e-16 ***
## age_group50-54       -0.51027    0.03089 -16.518  < 2e-16 ***
## age_group55-59       -0.56573    0.03174 -17.823  < 2e-16 ***
## age_group60-64       -0.54175    0.03274 -16.546  < 2e-16 ***
## age_group65-69       -0.61494    0.03637 -16.905  < 2e-16 ***
## age_group70-74       -0.63598    0.04426 -14.370  < 2e-16 ***
## age_group75-79       -0.76590    0.06457 -11.862  < 2e-16 ***
## age_group80 or older -1.22174    0.09648 -12.663  < 2e-16 ***
## sexFemale             0.99397    0.01413  70.341  < 2e-16 ***
## obese:covid_period   -0.07460    0.03163  -2.359   0.0183 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136267  on 134282  degrees of freedom
## Residual deviance: 130191  on 134266  degrees of freedom
## AIC: 130225
## 
## Number of Fisher Scoring iterations: 4


# Odds Ratios (Pre-COVID vs Post-COVID)
To make results interpretable, we computed odds ratios (ORs) for obesity separately in each COVID period. The pre-COVID OR uses only the obesity coefficient, while the post-COVID OR combines the obesity coefficient with the interaction term before exponentiating.

OR_pre <- exp(coef(model_int_fixed)["obese"])

OR_post <- exp(
coef(model_int_fixed)["obese"] +
coef(model_int_fixed)["obese:covid_period"]
)

OR_pre
##    obese 
## 1.299559
OR_post
##    obese 
## 1.206138
or_df <- data.frame(
Period = c("Pre-COVID", "Post-COVID"),
Odds_Ratio = c(as.numeric(OR_pre), as.numeric(OR_post))
)

ggplot(or_df, aes(x = Period, y = Odds_Ratio)) +
geom_bar(stat = "identity", fill = "orange") +
geom_hline(yintercept = 1, linetype = "dashed") +
labs(
title = "Change in Obesity–Depression Association Pre vs Post COVID",
x = "Period",
y = "Odds Ratio (Obese vs Non-obese)"
) +
theme_minimal()


Interpretation: If the odds ratio is greater than 1, obese adults have higher odds of depression than non-obese adults. If OR_post is higher than OR_pre, the obesity–depression association strengthened after COVID began; if OR_post is lower, the association weakened slightly post-COVID.

4.3 Predicted Probability of Depression (Obesity × COVID Period)


Odds ratios are helpful, but predicted probabilities are often easier to interpret. We estimated predicted depression probabilities for four groups: 1) Non-obese, Pre-COVID 2) Obese, Pre-COVID 3) Non-obese, Post-COVID 4) Obese, Post-COVID

To isolate obesity and COVID effects, predictions were generated holding age_group and sex constant at a reference level (the first factor level in the dataset).

pred_grid <- expand.grid(
obese = c(0, 1),
covid_period = c(0, 1),
age_group = levels(model_data$age_group)[1],
sex = levels(model_data$sex)[1]
)

pred_grid$pred_prob <- predict(
model_int_fixed,
newdata = pred_grid,
type = "response"
)

pred_grid <- pred_grid %>%
mutate(
Obesity_Status = if_else(obese == 1, "Obese", "Non-Obese"),
Period = if_else(covid_period == 1, "Post-COVID", "Pre-COVID")
)

ggplot(pred_grid, aes(x = Period, y = pred_prob, fill = Obesity_Status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Predicted Probability of Depression by Obesity and COVID Period",
x = "Period",
y = "Predicted Probability of Depression"
) +
theme_minimal()


Interpretation: If predicted depression probabilities rise in the post-COVID period for both obesity groups, that suggests overall depression risk increased after the pandemic. If the gap between obese and non-obese changes from pre to post, that indicates the relationship between obesity and depression shifted across periods.

4.3.1 Predicted Probability with 95% Confidence Intervals

Predicted Probability with 95% Confidence Intervals
To quantify uncertainty, we computed 95% confidence intervals around predicted probabilities using standard errors on the log-odds scale and transforming back to probability.

pred_probs <- predict(model_int_fixed, newdata = pred_grid, type = "link", se.fit = TRUE)

pred_grid$fit <- plogis(pred_probs$fit)
pred_grid$lwr <- plogis(pred_probs$fit - 1.96 * pred_probs$se.fit)
pred_grid$upr <- plogis(pred_probs$fit + 1.96 * pred_probs$se.fit)

ggplot(pred_grid, aes(x = Period, y = fit, color = Obesity_Status)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1) +
geom_line(aes(group = Obesity_Status)) +
labs(
title = "Predicted Probability of Depression with 95% CI",
y = "Predicted Probability"
) +
theme_minimal()

4.3.2 Critical Limitations

Self-Reported Outcome: Depression history is self-reported and may include reporting bias; it also does not measure severity.

Reference-Level Predictions: Predicted probabilities are computed at a fixed reference age_group and sex, so absolute probability values differ for other demographic profiles.

Omitted Variables: This model controls only for age group and sex; additional covariates (income, race/ethnicity, exercise, alcohol use, comorbidities) may further explain depression risk and change the estimated obesity effect.

4.3.3 Recommendations for Future Work

Expanded Controls: Add socioeconomic and behavioral covariates to test robustness of the obesity effect.

Subgroup Analysis: Fit the model separately by sex or age_group to check for heterogeneous effects.

Survey Design Considerations: If survey weights are available in the BRFSS file, incorporate them to improve population-level inference.

5 References


- Centers for Disease Control and Prevention. (2023). BRFSS overview. https://www.cdc.gov/brfss/annual_data/2023/pdf/Overview_2023-508.pdf
- Centers for Disease Control and Prevention. (2021). Behavioral Risk Factor Surveillance System comparability of data BRFSS 2020. https://www.cdc.gov/brfss/annual_data/2020/pdf/compare-2020-508.pdf